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Abstract 

Image-based sequence recognition has been a long¬ 
standing research topic in computer vision. In this pa¬ 
per, we investigate the problem of scene text recognition, 
which is among the most important and challenging tasks 
in image-based sequence recognition. A novel neural net¬ 
work architecture, which integrates feature extraction, se¬ 
quence modeling and transcription into a unified frame¬ 
work, is proposed. Compared with previous systems for 
scene text recognition, the proposed architecture possesses 
four distinctive properties: (1) It is end-to-end trainable, 
in contrast to most of the existing algorithms whose compo¬ 
nents are separately trained and tuned. (2) It naturally han¬ 
dles sequences in arbitrary lengths, involving no character 
segmentation or horizontal scale normalization. (3) It is not 
confined to any predefined lexicon and achieves remarkable 
performances in both lexicon-free and lexicon-based scene 
text recognition tasks. (4) It generates an effective yet much 
smaller model, which is more practical for real-world ap¬ 
plication scenarios. The experiments on standard bench¬ 
marks, including the IIIT-5K, Street View Text and ICDAR 
datasets, demonstrate the superiority of the proposed algo¬ 
rithm over the prior arts. Moreover, the proposed algorithm 
performs well in the task of image-based music score recog¬ 
nition, which evidently verifies the generality of it. 


1. Introduction 

Recently, the community has seen a strong revival of 
neural networks, which is mainly stimulated by the great 
success of deep neural network models, specifically Deep 
Convolutional Neural Networks (DCNN), in various vision 
tasks. However, majority of the recent works related to deep 
neural networks have devoted to detection or classification 
of object categories [12, 25]. In this paper, we are con¬ 
cerned with a classic problem in computer vision: image- 
based sequence recognition. In real world, a stable of vi¬ 


sual objects, such as scene text, handwriting and musical 
score, tend to occur in the form of sequence, not in isola¬ 
tion. Unlike general object recognition, recognizing such 
sequence-like objects often requires the system to predict 
a series of object labels, instead of a single label. There¬ 
fore, recognition of such objects can be naturally cast as a 
sequence recognition problem. Another unique property of 
sequence-like objects is that their lengths may vary drasti¬ 
cally. For instance, English words can either consist of 2 
characters such as “OK” or 15 characters such as “congrat¬ 
ulations”. Consequently, the most popular deep models like 
DCNN [25, 26] cannot be directly applied to sequence pre¬ 
diction, since DCNN models often operate on inputs and 
outputs with fixed dimensions, and thus are incapable of 
producing a variable-length label sequence. 

Some attempts have been made to address this problem 
for a specific sequence-like object {e.g. scene text). For 
example, the algorithms in [35, 8] firstly detect individual 
characters and then recognize these detected characters with 
DCNN models, which are trained using labeled character 
images. Such methods often require training a strong char¬ 
acter detector for accurately detecting and cropping each 
character out from the original word image. Some other 
approaches (such as [22]) treat scene text recognition as 
an image classification problem, and assign a class label 
to each English word (90K words in total). It turns out a 
large trained model with a huge number of classes, which 
is difficult to be generalized to other types of sequence¬ 
like objects, such as Chinese texts, musical scores, etc., be¬ 
cause the numbers of basic combinations of such kind of 
sequences can be greater than 1 million. In summary, cur¬ 
rent systems based on DCNN can not be directly used for 
image-based sequence recognition. 

Recurrent neural networks (RNN) models, another im¬ 
portant branch of the deep neural networks family, were 
mainly designed for handling sequences. One of the ad¬ 
vantages of RNN is that it does not need the position of 
each element in a sequence object image in both training 
and testing. However, a preprocessing step that converts 
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an input object image into a sequence of image features, is 
usually essential. For example, Graves et al [16] extract a 
set of geometrical or image features from handwritten texts, 
while Su and Lu [33] convert word images into sequential 
HOG features. The preprocessing step is independent of 
the subsequent components in the pipeline, thus the existing 
systems based on RNN can not be trained and optimized in 
an end-to-end fashion. 

Several conventional scene text recognition methods that 
are not based on neural networks also brought insightful 
ideas and novel representations into this field. For example, 
Almazan et al. [5] and Rodriguez-Serrano et al. [30] pro¬ 
posed to embed word images and text strings in a common 
vectorial subspace, and word recognition is converted into 
a retrieval problem. Yao et al. [36] and Gordo et al. [14] 
used mid-level features for scene text recognition. Though 
achieved promising performance on standard benchmarks, 
these methods are generally outperformed by previous al¬ 
gorithms based on neural networks [8, 22], as well as the 
approach proposed in this paper. 

The main contribution of this paper is a novel neural 
network model, whose network architecture is specifically 
designed for recognizing sequence-like objects in images. 
The proposed neural network model is named as Convo¬ 
lutional Recurrent Neural Network (CRNN), since it is a 
combination of DCNN and RNN. For sequence-like ob¬ 
jects, CRNN possesses several distinctive advantages over 
conventional neural network models: 1) It can be directly 
learned from sequence labels (for instance, words), requir¬ 
ing no detailed annotations (for instance, characters); 2) It 
has the same property of DCNN on learning informative 
representations directly from image data, requiring neither 
hand-craft features nor preprocessing steps, including bi- 
narization/segmentation, component localization, etc .; 3) It 
has the same property of RNN, being able to produce a se¬ 
quence of labels; 4) It is unconstrained to the lengths of 
sequence-like objects, requiring only height normalization 
in both training and testing phases; 5) It achieves better or 
highly competitive performance on scene texts (word recog¬ 
nition) than the prior arts [23, 8]; 6) It contains much less 
parameters than a standard DCNN model, consuming less 
storage space. 

2. The Proposed Network Architecture 

The network architecture of CRNN, as shown in Fig. 1, 
consists of three components, including the convolutional 
layers, the recurrent layers, and a transcription layer, from 
bottom to top. 

At the bottom of CRNN, the convolutional layers auto¬ 
matically extract a feature sequence from each input image. 
On top of the convolutional network, a recurrent network 
is built for making prediction for each frame of the feature 
sequence, outputted by the convolutional layers. The tran¬ 


scription layer at the top of CRNN is adopted to translate the 
per-frame predictions by the recurrent layers into a label se¬ 
quence. Though CRNN is composed of different kinds of 
network architectures (eg. CNN and RNN), it can be jointly 
trained with one loss function. 
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Figure 1. The network architecture. The architecture consists of 
three parts: 1) convolutional layers, which extract a feature se¬ 
quence from the input image; 2) recurrent layers, which predict 
a label distribution for each frame; 3) transcription layer, which 
translates the per-frame predictions into the final label sequence. 


2.1. Feature Sequence Extraction 

In CRNN model, the component of convolutional layers 
is constructed by taking the convolutional and max-pooling 
layers from a standard CNN model (fully-connected layers 
are removed). Such component is used to extract a sequen¬ 
tial feature representation from an input image. Before be¬ 
ing fed into the network, all the images need to be scaled 
to the same height. Then a sequence of feature vectors is 
extracted from the feature maps produced by the compo¬ 
nent of convolutional layers, which is the input for the re¬ 
current layers. Specifically, each feature vector of a feature 
sequence is generated from left to right on the feature maps 
by column. This means the i-th feature vector is the con¬ 
catenation of the i-th columns of all the maps. The width of 
each column in our settings is fixed to single pixel. 

As the layers of convolution, max-pooling, and element¬ 
wise activation function operate on local regions, they are 
translation invariant. Therefore, each column of the feature 
maps corresponds to a rectangle region of the original im- 









age (termed the receptive field), and such rectangle regions 
are in the same order to their corresponding columns on the 
feature maps from left to right. As illustrated in Fig. 2, each 
vector in the feature sequence is associated with a receptive 
field, and can be considered as the image descriptor for that 
region. 


Feature Sequence 



Receptive field 

Figure 2. The receptive field. Each vector in the extracted feature 
sequence is associated with a receptive field on the input image, 
and can be considered as the feature vector of that field. 



Figure 3. (a) The structure of a basic LSTM unit. An LSTM con¬ 
sists of a cell module and three gates, namely the input gate, the 
output gate and the forget gate, (b) The structure of deep bidirec¬ 
tional LSTM we use in our paper. Combining a forward (left to 
right) and a backward (right to left) LSTMs results in a bidirec¬ 
tional LSTM. Stacking multiple bidirectional LSTM results in a 
deep bidirectional LSTM. 


Being robust, rich and trainable, deep convolutional fea¬ 
tures have been widely adopted for different kinds of vi¬ 
sual recognition tasks [25, 12]. Some previous approaches 
have employed CNN to learn a robust representation for 
sequence-like objects such as scene text [22]. However, 
these approaches usually extract holistic representation of 
the whole image by CNN, then the local deep features are 
collected for recognizing each component of a sequence¬ 
like object. Since CNN requires the input images to be 
scaled to a fixed size in order to satisfy with its fixed input 
dimension, it is not appropriate for sequence-like objects 
due to their large length variation. In CRNN, we convey 
deep features into sequential representations in order to be 
invariant to the length variation of sequence-like objects. 

2.2. Sequence Labeling 

A deep bidirectional Recurrent Neural Network is built 
on the top of the convolutional layers, as the recurrent lay¬ 
ers. The recurrent layers predict a label distribution yt for 
each frame Xt in the feature sequence x = xi,..., xt- The 
advantages of the recurrent layers are three-fold. Firstly, 
RNN has a strong capability of capturing contextual in¬ 
formation within a sequence. Using contextual cues for 
image-based sequence recognition is more stable and help¬ 
ful than treating each symbol independently. Taking scene 
text recognition as an example, wide characters may re¬ 
quire several successive frames to fully describe (refer to 
Fig. 2). Besides, some ambiguous characters are easier to 
distinguish when observing their contexts, e.g. it is easier to 
recognize “il” by contrasting the character heights than by 
recognizing each of them separately. Secondly, RNN can 
back-propagates error differentials to its input, i.e. the con¬ 
volutional layer, allowing us to jointly train the recurrent 
layers and the convolutional layers in a unified network. 


Thirdly, RNN is able to operate on sequences of arbitrary 
lengths, traversing from starts to ends. 

A traditional RNN unit has a self-connected hidden layer 
between its input and output layers. Each time it receives 
a frame Xt in the sequence, it updates its internal state ht 
with a non-linear function that takes both current input Xt 
and past state ht-i as its inputs: ht = g{xt^ht-i). Then 
the prediction yt is made based on ht. In this way, past con¬ 
texts {xf }t'<t are captured and utilized for prediction. Tra¬ 
ditional RNN unit, however, suffers from the vanishing gra¬ 
dient problem [7], which limits the range of context it can 
store, and adds burden to the training process. Long-Short 
Term Memory [18, 11] (LSTM) is a type of RNN unit that 
is specially designed to address this problem. An LSTM (il¬ 
lustrated in Fig. 3) consists of a memory cell and three mul¬ 
tiplicative gates, namely the input, output and forget gates. 
Conceptually, the memory cell stores the past contexts, and 
the input and output gates allow the cell to store contexts 
for a long period of time. Meanwhile, the memory in the 
cell can be cleared by the forget gate. The special design of 
LSTM allows it to capture long-range dependencies, which 
often occur in image-based sequences. 

LSTM is directional, it only uses past contexts. How¬ 
ever, in image-based sequences, contexts from both direc¬ 
tions are useful and complementary to each other. There¬ 
fore, we follow [ 1 7] and combine two LSTMs, one forward 
and one backward, into a bidirectional LSTM. Furthermore, 
multiple bidirectional LSTMs can be stacked, resulting in 
a deep bidirectional LSTM as illustrated in Fig. 3.b. The 
deep structure allows higher level of abstractions than a 
shallow one, and has achieved significant performance im¬ 
provements in the task of speech recognition [17]. 

In recurrent layers, error differentials are propagated in 
the opposite directions of the arrows shown in Fig. 3.b, 















i.e. Back-Propagation Through Time (BPTT). At the bot¬ 
tom of the recurrent layers, the sequence of propagated dif¬ 
ferentials are concatenated into maps, inverting the opera¬ 
tion of converting feature maps into feature sequences, and 
fed back to the convolutional layers. In practice, we create 
a custom network layer, called “Map-to-Sequence”, as the 
bridge between convolutional layers and recurrent layers. 

2.3. Transcription 

Transcription is the process of converting the per-frame 
predictions made by RNN into a label sequence. Mathe¬ 
matically, transcription is to find the label sequence with 
the highest probability conditioned on the per-frame pre¬ 
dictions. In practice, there exists two modes of transcrip¬ 
tion, namely the lexicon-free and lexicon-based transcrip¬ 
tions. A lexicon is a set of label sequences that prediction 
is constraint to, e.g. a spell checking dictionary. In lexicon- 
free mode, predictions are made without any lexicon. In 
lexicon-based mode, predictions are made by choosing the 
label sequence that has the highest probability. 

2.3.1 Probability of label sequence 

We adopt the conditional probability defined in the Con- 
nectionist Temporal Classification (CTC) layer proposed 
by Graves et al. [15]. The probability is defined for la¬ 
bel sequence 1 conditioned on the per-frame predictions 
y = ^ 1 ,..., and it ignores the position where each la¬ 
bel in 1 is located. Consequently, when we use the negative 
log-likelihood of this probability as the objective to train the 
network, we only need images and their corresponding la¬ 
bel sequences, avoiding the labor of labeling positions of 
individual characters. 

The formulation of the conditional probability is briefiy 
described as follows: The input is a sequence y = 

,..., where T is the sequence length. Here, each 
yt G I is a probability distribution over the set C' = 
C U , where L contains all labels in the task {e.g. all En¬ 
glish characters), as well as a ’blank’ label denoted by . A 
sequence-to-sequence mapping function B is defined on se¬ 
quence TT G , where T is the length. B maps tt onto 1 
by firstly removing the repeated labels, then removing the 
’blank’s. For example, B maps “—hh-e-l-ll-oo—” 
(’-’ represents ’blank’) onto “hello”. Then, the condi¬ 
tional probability is defined as the sum of probabilities of 
all TT that are mapped by B onto 1: 

p(l|y) = E] p(7r|y), (1) 

7r:23(7r)=l 

where the probability of tt is defined as p(7r|y) = 

n^i y^TTt probability of having label tt^ at time 
stamp t. Directly computing Eq. 1 would be computa¬ 
tionally infeasible due to the exponentially large number 


of summation items. However, Eq. 1 can be efficiently 
computed using the forward-backward algorithm described 
in [15]. 

2.3.2 Lexicon-free transcription 

In this mode, the sequence 1* that has the highest proba¬ 
bility as defined in Eq. 1 is taken as the prediction. Since 
there exists no tractable algorithm to precisely find the so¬ 
lution, we use the strategy adopted in [15]. The sequence 1* 
is approximately found by 1* « S(argmax7rP(7r|y)), i.e. 
taking the most probable label tt^ at each time stamp t, and 
map the resulted sequence onto 1*. 

2.3.3 Lexicon-based transcription 

In lexicon-based mode, each test sample is associated with 
a lexicon V. Basically, the label sequence is recognized 
by choosing the sequence in the lexicon that has high¬ 
est conditional probability defined in Eq. 1, i.e. 1* = 
argmaxi^X)p(l|y). However, for large lexicons, ^.g. the 
5Ok-words Hunspell spell-checking dictionary [1], it would 
be very time-consuming to perform an exhaustive search 
over the lexicon, i.e. to compute Equation 1 for all se¬ 
quences in the lexicon and choose the one with the high¬ 
est probability. To solve this problem, we observe that the 
label sequences predicted via lexicon-free transcription, de¬ 
scribed in 2.3.2, are often close to the ground-truth under 
the edit distance metric. This indicates that we can limit our 
search to the nearest-neighbor candidates A/’^(l'), where 6 is 
the maximal edit distance and 1' is the sequence transcribed 
from y in lexicon-free mode: 

l*=arg max p(l|y). (2) 

ieAr5(i') 

The candidates A/ 5 (l') can be found efficiently with the 
BK-tree data structure [9], which is a metric tree specifi¬ 
cally adapted to discrete metric spaces. The search time 
complexity of BK-tree is 0(log |D|), where |D| is the lex¬ 
icon size. Therefore this scheme readily extends to very 
large lexicons. In our approach, a BK-tree is constructed 
offline for a lexicon. Then we perform fast online search 
with the tree, by finding sequences that have less or equal to 
6 edit distance to the query sequence. 

2.4. Network Training 

Denote the training dataset by A' = 1^}^, where li is 

the training image and 1^ is the ground truth label sequence. 
The objective is to minimize the negative log-likelihood of 
conditional probability of ground truth: 

0 = - logp(lilyi), (3) 

hMex 


where is the sequence produced by the recurrent and con¬ 
volutional layers from This objective function calculates 
a cost value directly from an image and its ground truth 
label sequence. Therefore, the network can be end-to-end 
trained on pairs of images and sequences, eliminating the 
procedure of manually labeling all individual components 
in training images. 

The network is trained with stochastic gradient descent 
(SGD). Gradients are calculated by the back-propagation al¬ 
gorithm. In particular, in the transcription layer, error dif¬ 
ferentials are back-propagated with the forward-backward 
algorithm, as described in [15]. In the recurrent layers, the 
Back-Propagation Through Time (BPTT) is applied to cal¬ 
culate the error differentials. 

For optimization, we use the ADADELTA [37] to au¬ 
tomatically calculate per-dimension learning rates. Com¬ 
pared with the conventional momentum [31] method, 
ADADELTA requires no manual setting of a learning 
rate. More importantly, we find that optimization using 
ADADELTA converges faster than the momentum method. 

3. Experiments 

To evaluate the effectiveness of the proposed CRNN 
model, we conducted experiments on standard benchmarks 
for scene text recognition and musical score recognition, 
which are both challenging vision tasks. The datasets and 
setting for training and testing are given in Sec. 3.1, the de¬ 
tailed settings of CRNN for scene text images is provided 
in Sec. 3.2, and the results with the comprehensive compar¬ 
isons are reported in Sec. 3.3. To further demonstrate the 
generality of CRNN, we verify the proposed algorithm on a 
music score recognition task in Sec. 3.4. 

3.1. Datasets 

Eor all the experiments for scene text recognition, we 
use the synthetic dataset (Synth) released by Jaderberg et 
al [20] as the training data. The dataset contains 8 millions 
training images and their corresponding ground truth words. 
Such images are generated by a synthetic text engine and 
are highly realistic. Our network is trained on the synthetic 
data once, and tested on all other real-world test datasets 
without any fine-tuning on their training data. Even though 
the CRNN model is purely trained with synthetic text data, 
it works well on real images from standard text recognition 
benchmarks. 

Eour popular benchmarks for scene text recognition are 
used for performance evaluation, namely ICDAR 2003 
(IC03), ICDAR 2013 (IC13), HIT 5k-word (IIIT5k), and 
Street View Text (SVT). 

IC03 [27] test dataset contains 251 scene images with la¬ 
beled text bounding boxes. Eollowing Wang et al. [34], we 
ignore images that either contain non-alphanumeric charac¬ 
ters or have less than three characters, and get a test set with 


Table 1. Network configuration summary. The first row is the top 
layer, ‘k’, ‘s’ and ‘p’ stand for kernel size, stride and padding size 
respectively 


Type 

Configurations 

Transcription 

- 

Bidirectional-LSTM 

#hidden units:256 

Bidirectional-LSTM 

#hidden units:256 

Map-to-Sequence 

- 

Convolution 

#maps:512, k:2 x 2, s:l, p:0 

MaxPooling 

Window:! x 2, s:2 

BatchNormalization 

- 

Convolution 

#maps:512, k:3 x 3, s:l, p:l 

BatchNormalization 

- 

Convolution 

#maps:512, k:3 x 3, s:l, p:l 

MaxPooling 

Window:! x 2, s:2 

Convolution 

#maps:256, k:3 x 3, s:l, p:l 

Convolution 

#maps:256, k:3 x 3, s:l, p:l 

MaxPooling 

Window:2 x 2, s:2 

Convolution 

#maps:!28, k:3 x 3, s:l, p:l 

MaxPooling 

Window:2 x 2, s:2 

Convolution 

#maps:64, k:3 x 3, s:l, p:l 

Input 

VC X 32 gray-scale image 


860 cropped text images. Each test image is associated with 
a 50-words lexicon which is defined by Wang et al [34]. A 
full lexicon is built by combining all the per-image lexi¬ 
cons. In addition, we use a 50k words lexicon consisting of 
the words in the Hunspell spell-checking dictionary [1]. 

IC13 [24] test dataset inherits most of its data from IC03. 
It contains 1,015 ground truths cropped word images. 

IIITSk [28] contains 3,000 cropped word test images 
collected from the Internet. Each image has been associ¬ 
ated to a 50-words lexicon and a Ik-words lexicon. 

SVT [34] test dataset consists of 249 street view images 
collected from Google Street View. Erom them 647 word 
images are cropped. Each word image has a 50 words lexi¬ 
con defined by Wang et al [34]. 

3.2. Implementation Details 

The network configuration we use in our experiments 
is summarized in Table 1. The architecture of the con¬ 
volutional layers is based on the VGG-VeryDeep architec¬ 
tures [32]. A tweak is made in order to make it suitable 
for recognizing English texts. In the 3rd and the 4th max¬ 
pooling layers, we adopt 1x2 sized rectangular pooling 
windows instead of the conventional squared ones. This 
tweak yields feature maps with larger width, hence longer 
feature sequence. Eor example, an image containing 10 
characters is typically of size 100 x 32, from which a feature 
sequence 25 frames can be generated. This length exceeds 
the lengths of most English words. On top of that, the rect¬ 
angular pooling windows yield rectangular receptive fields 
(illustrated in Eig. 2), which are beneficial for recognizing 
some characters that have narrow shapes, such as ’i’ and T’. 

The network not only has deep convolutional layers, but 
also has recurrent layers. Both are known to be hard to 























train. We find that the batch normalization [19] technique 
is extremely useful for training network of such depth. Two 
batch normalization layers are inserted after the 5th and 6th 
convolutional layers respectively. With the batch normal¬ 
ization layers, the training process is greatly accelerated. 

We implement the network within the Torch? [ 1 0] frame¬ 
work, with custom implementations for the LSTM units (in 
Torch7/CUDA), the transcription layer (in C-\-+) and the 
BK-tree data structure (in C-\-+). Experiments are carried 
out on a workstation with a 2.50 GHz Intel(R) Xeon(R) E5- 
2609 CPU, 64GB RAM and an NVIDIA(R) Tesla(TM) K40 
GPU. Networks are trained with ADADELTA, setting the 
parameter p to 0.9. During training, all images are scaled 
to 100 X 32 in order to accelerate the training process. The 
training process takes about 50 hours to reach convergence. 
Testing images are scaled to have height 32. Widths are 
proportionally scaled with heights, but at least 100 pixels. 
The average testing time is 0.16s/sample, as measured on 
IC03 without a lexicon. The approximate lexicon search is 
applied to the 50k lexicon of IC03, with the parameter 5 set 
to 3. Testing each sample takes 0.53s on average. 

3.3. Comparative Evaluation 

All the recognition accuracies on the above four public 
datasets, obtained by the proposed CRNN model and the 
recent state-of-the-arts techniques including the approaches 
based on deep models [23, 22, 21], are shown in Table 2. 

In the constrained lexicon cases, our method consistently 
outperforms most state-of-the-arts approaches, and in aver¬ 
age beats the best text reader proposed in [22]. Specifically, 
we obtain superior performance on IIIT5k, and SVT com¬ 
pared to [22], only achieved lower performance on IC03 
with the “Eull” lexicon. Note that the model in[22] is 
trained on a specific dictionary, namely that each word is 
associated to a class label. Unlike [22], CRNN is not lim¬ 
ited to recognize a word in a known dictionary, and able to 
handle random strings {e.g. telephone numbers), sentences 
or other scripts like Chinese words. Therefore, the results 
of CRNN are competitive on all the testing datasets. 

In the unconstrained lexicon cases, our method achieves 
the best performance on SVT, yet, is still behind some ap¬ 
proaches [8, 22] on IC03 and IC13. Note that the blanks 
in the “none” columns of Table 2 denote that such ap¬ 
proaches are unable to be applied to recognition without 
lexicon or did not report the recognition accuracies in the 
unconstrained cases. Our method uses only synthetic text 
with word level labels as the training data, very different to 
PhotoOCR [8] which used 7.9 millions of real word images 
with character-level annotations for training. The best per¬ 
formance is reported by [22] in the unconstrained lexicon 
cases, benefiting from its large dictionary, however, it is not 
a model strictly unconstrained to a lexicon as mentioned be¬ 
fore. In this sense, our results in the unconstrained lexicon 


Table 3. Comparison among various methods. Attributes for com¬ 
parison include: 1) being end-to-end trainable (E2E Train); 2) 
using convolutional features that are directly learned from im¬ 
ages rather than using hand-crafted ones (Conv Ftrs); 3) requir¬ 
ing no ground truth bounding boxes for characters during training 
(CharGT-Free); 4) not confined to a pre-defined dictionary (Un¬ 
constrained); 5) the model size (if an end-to-end trainable model 
is used), measured by the number of model parameters (Model 
Size, M stands for millions). 



E2E Train 

Conv Ftrs 

CharGT-Free 

Unconstrained 

Model Size 

Wang et al [34] 

X 

X 

X 

✓ 

- 

Mishra et al. [28] 

X 

X 

X 

X 

- 

Wang et al. [35] 

X 

✓ 

X 

✓ 

- 

Goq\ etal. [13] 

X 

X 

✓ 

X 

- 

Bissacco et al. [8] 

X 

X 

X 

✓ 

- 

Alsharif and Pineau [6] 

X 

✓ 

X 

✓ 

- 

Almazan et al. [5] 

X 

X 

✓ 

X 

- 

Yao et al. [36] 

X 

X 

X 

✓ 

- 

Rodrguez-Serrano et al. [30] 

X 

X 

✓ 

X 

- 

Jaderberg et al. [23] 

X 

✓ 

X 

✓ 

- 

Su and Lu [33] 

X 

X 

✓ 

✓ 

- 

Gordo [14] 

X 

X 

X 

X 

- 

Jaderberg et al. [22] 

✓ 

✓ 

✓ 

X 

490M 

Jaderberg er <3/. [21] 

✓ 

✓ 

✓ 

✓ 

304M 

CRNN 

✓ 

✓ 

✓ 

✓ 

8.3M 


case are still promising. 

For further understanding the advantages of the pro¬ 
posed algorithm over other text recognition approaches, we 
provide a comprehensive comparison on several properties 
named E2E Train, Conv Ftrs, CharGT-Free, Unconstrained, 
and Model Size, as summarized in Table 3. 

E2E Train: This column is to show whether a certain 
text reading model is end-to-end trainable, without any pre- 
process or through several separated steps, which indicates 
such approaches are elegant and clean for training. As can 
be observed from Table 3, only the models based on deep 
neural networks including [22, 21] as well as CRNN have 
this property. 

Conv Ftrs: This column is to indicate whether an ap¬ 
proach uses the convolutional features learned from training 
images directly or handcraft features as the basic represen¬ 
tations. 

CharGT-Free: This column is to indicate whether the 
character-level annotations are essential for training the 
model. As the input and output labels of CRNN can be a 
sequence, character-level annotations are not necessary. 

Unconstrained: This column is to indicate whether the 
trained model is constrained to a specific dictionary, unable 
to handling out-of-dictionary words or random sequences. 





Table 2. Recognition accuracies (%) on four datasets. In the second row, “50”, “Ik”, “50k” and “Full” denote the lexicon used, and “None’ 
denotes recognition without a lexicon. (*[22] is not lexicon-free in the strict sense, as its outputs are constrained to a 90k dictionary. 




IIITSk 


SVT 


IC03 


IC13 

50 

Ik 

None 

50 

None 

50 

Full 

50k 

None 

None 

ABBYY [34] 

24.3 

- 

- 

35.0 

- 

56.0 

55.0 

- 

- 

- 

Wang et al. [34] 

- 

- 

- 

57.0 

- 

76.0 

62.0 

- 

- 

- 

Mishra et al. [28] 

64.1 

57.5 

- 

73.2 

- 

81.8 

67.8 

- 

- 

- 

Wang et al. [35] 

- 

- 

- 

70.0 

- 

90.0 

84.0 

- 

- 

- 

Go&l etal. [13] 

- 

- 

- 

77.3 

- 

89.7 

- 

- 

- 

- 

Bissacco et al. [8] 

- 

- 

- 

90.4 

78.0 

- 

- 

- 

- 

87.6 

Alsharif and Pineau [6] 

- 

- 

- 

74.3 

- 

93.1 

88.6 

85.1 

- 

- 

Almazan et al. [5] 

91.2 

82.1 

- 

89.2 

- 

- 

- 

- 

- 

- 

Yao et al. [36] 

80.2 

69.3 

- 

75.9 

- 

88.5 

80.3 

- 

- 

- 

Rodrguez-Serrano et al. [30] 

76.1 

57.4 

- 

70.0 

- 

- 

- 

- 

- 

- 

Jaderberg et al. [23] 

- 

- 

- 

86.1 

- 

96.2 

91.5 

- 

- 

- 

Su and Lu [33] 

- 

- 

- 

83.0 

- 

92.0 

82.0 

- 

- 

- 

Gordo [14] 

93.3 

86.6 

- 

91.8 

- 

- 

- 

- 

- 

- 

Jaderberg et al. [22] 

97.1 

92.7 

- 

95.4 

80.7* 

98.7 

98.6 

93.3 

93.1* 

90.8* 

Jaderberg et«/. [21] 

95.5 

89.6 

- 

93.2 

71.7 

97.8 

97.0 

93.4 

89.6 

81.8 

CRNN 

97.6 

94.4 

78.2 

96.4 

80.8 

98.7 

97.6 

95.5 

89.4 

86.7 


Notice that though the recent models learned by label em¬ 
bedding [5, 14] and incremental learning [22] achieved 
highly competitive performance, they are constrained to a 
specific dictionary. 

Model Size: This column is to report the storage space 
of the learned model. In CRNN, all layers have weight¬ 
sharing connections, and the fully-connected layers are not 
needed. Consequently, the number of parameters of CRNN 
is much less than the models learned on the variants of CNN 
[22, 21], resulting in a much smaller model compared with 
[22, 21]. Our model has 8.3 million parameters, taking only 
33MB RAM (using 4-bytes single-precision float for each 
parameter), thus it can be easily ported to mobile devices. 

Table 3 clearly shows the differences among different ap¬ 
proaches in details, and fully demonstrates the advantages 
of CRNN over other competing methods. 

In addition, to test the impact of parameter S, we exper¬ 
iment different values of (5 in Eq. 2. In Fig. 4 we plot the 
recognition accuracy as a function of S. Larger S results 
in more candidates, thus more accurate lexicon-based tran¬ 
scription. On the other hand, the computational cost grows 
with larger S, due to longer BK-tree search time, as well as 
larger number of candidate sequences for testing. In prac¬ 
tice, we choose S = 3 as a tradeoff between accuracy and 
speed. 

3.4. Musical Score Recognition 

A musical score typically consists of sequences of mu¬ 
sical notes arranged on staff lines. Recognizing musical 
scores in images is known as the Optical Music Recogni¬ 
tion (OMR) problem. Previous methods often requires im¬ 
age preprocessing (mostly binirization), staff lines detection 


IC03 (50k lexicon) 



Figure 4. Blue line graph: recognition accuracy as a function pa¬ 
rameter S. Red bars: lexicon search time per sample. Tested on 
the IC03 dataset with the 50k lexicon. 


and individual notes recognition [29]. We cast the OMR 
as a sequence recognition problem, and predict a sequence 
of musical notes directly from the image with CRNN. For 
simplicity, we recognize pitches only, ignore all chords and 
assume the same major scales (C major) for all scores. 

To the best of our knowledge, there exists no public 
datasets for evaluating algorithms on pitch recognition. To 
prepare the training data needed by CRNN, we collect 2650 
images from [2] . Each image contains a fragment of score 
containing 3 to 20 notes. We manually label the ground 
truth label sequences (sequences of not ezpitches) for all 
the images. The collected images are augmented to 265k 
training samples by being rotated, scaled and corrupted with 
noise, and by replacing their backgrounds with natural im¬ 
ages. For testing, we create three datasets: 1) “Clean”, 




































which contains 260 images collected from [2]. Examples 
are shown in Fig. 5. a; 2) “Synthesized”, which is created 
from “Clean”, using the augmentation strategy mentioned 
above. It contains 200 samples, some of which are shown 
in Fig. 5.b; 3) “Real-World”, which contains 200 images 
of score fragments taken from music books with a phone 
camera. Examples are shown in Fig. 5.c.^ 
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Figure 5. (a) Clean musical scores images collected from [2] (b) 
Synthesized musical score images, (c) Real-world score images 
taken with a mobile phone camera. 


Since we have limited training data, we use a simpli¬ 
fied CRNN configuration in order to reduce model capac¬ 
ity. Different from the configuration specified in Tab. 1, 
the 4th and 6th convolution layers are removed, and the 
2-layer bidirectional LSTM is replaced by a 2-layer sin¬ 
gle directional LSTM. The network is trained on the pairs 
of images and corresponding label sequences. Two mea¬ 
sures are used for evaluating the recognition performance: 
1) fragment accuracy, i.e. the percentage of score fragments 
correctly recognized; 2) average edit distance, i.e. the av¬ 
erage edit distance between predicted pitch sequences and 
the ground truths. For comparison, we evaluate two com¬ 
mercial OMR engines, namely the Capella Scan [3] and the 
PhotoScore [4]. 

Table 4. Comparison of pitch recognition accuracies, among 
CRNN and two commercial OMR systems, on the three datasets 
we have collected. Performances are evaluated by fragment accu¬ 
racies and average edit distance (“fragment accuracy/average edit 
distance”). 

Clean Synthesized Real-World 

Capella Scan [3] 51.9%/1.75 20.0%/2.31 43.5%/3.05 

PhotoScore [ ] 55.0%/2.34 28.0%/1.85 20.4%/3.00 

CRNN 74.6%/0.37 81.5%/0.30 84.0%/0.30 


^We will release the dataset for academic use. 


Tab. 4 summarizes the results. The CRNN outper¬ 
forms the two commercial systems by a large margin. The 
Capella Scan and PhotoScore systems perform reasonably 
well on the Clean dataset, but their performances drop sig¬ 
nificantly on synthesized and real-world data. The main 
reason is that they rely on robust binarization to detect staff 
lines and notes, but the binarization step often fails on syn¬ 
thesized and real-world data due to bad lighting condition, 
noise corruption and cluttered background. The CRNN, on 
the other hand, uses convolutional features that are highly 
robust to noises and distortions. Besides, recurrent layers in 
CRNN can utilize contextual information in the score. Each 
note is recognized not only itself, but also by the nearby 
notes. Consequently, some notes can be recognized by com¬ 
paring them with the nearby notes, e.g. contrasting their 
vertical positions. 

The results have shown the generality of CRNN, in that 
it can be readily applied to other image-based sequence 
recognition problems, requiring minimal domain knowl¬ 
edge. Compared with Capella Scan and PhotoScore, our 
CRNN-based system is still preliminary and misses many 
functionalities. But it provides a new scheme for OMR, and 
has shown promising capabilities in pitch recognition. 

4. Conclusion 

In this paper, we have presented a novel neural net¬ 
work architecture, called Convolutional Recurrent Neural 
Network (CRNN), which integrates the advantages of both 
Convolutional Neural Networks (CNN) and Recurrent Neu¬ 
ral Networks (RNN). CRNN is able to take input images of 
varying dimensions and produces predictions with different 
lengths. It directly runs on coarse level labels {e.g. words), 
requiring no detailed annotations for each individual ele¬ 
ment {e.g. characters) in the training phase. Moreover, 
as CRNN abandons fully connected layers used in conven¬ 
tional neural networks, it results in a much more compact 
and efficient model. All these properties make CRNN an 
excellent approach for image-based sequence recognition. 

The experiments on the scene text recognition bench¬ 
marks demonstrate that CRNN achieves superior or highly 
competitive performance, compared with conventional 
methods as well as other CNN and RNN based algorithms. 
This confirms the advantages of the proposed algorithm. In 
addition, CRNN significantly outperforms other competi¬ 
tors on a benchmark for Optical Music Recognition (OMR), 
which verifies the generality of CRNN. 

Actually, CRNN is a general framework, thus it can be 
applied to other domains and problems (such as Chinese 
character recognition), which involve sequence prediction 
in images. To further speed up CRNN and make it more 
practical in real-world applications is another direction that 
is worthy of exploration in the future. 
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